This document is the first in a series for a project looking at a simple method for pricing car insurance based on claims data. This rmarkdown document focuses on loading the initial data and performing some systematic data exploration and cleaning.

1 Load Data

##         policy_id claim_count   exposure power car_age driver_age
##      1:         1           0 0.09000000     g       0         46
##      2:         2           0 0.84000000     g       0         46
##      3:         3           0 0.52000000     f       2         38
##      4:         4           0 0.45000000     f       2         38
##      5:         5           0 0.15000000     g       0         41
##     ---                                                          
## 413165:    413165           0 0.00273973     j       0         29
## 413166:    413166           0 0.00547945     d       0         29
## 413167:    413167           0 0.00547945     k       0         49
## 413168:    413168           0 0.00273973     d       0         41
## 413169:    413169           0 0.00273973     g       6         29
##                                      brand    fuel region density
##      1: Japanese (except Nissan) or Korean  Diesel    R72      76
##      2: Japanese (except Nissan) or Korean  Diesel    R72      76
##      3: Japanese (except Nissan) or Korean Regular    R31    3003
##      4: Japanese (except Nissan) or Korean Regular    R31    3003
##      5: Japanese (except Nissan) or Korean  Diesel    R52      60
##     ---                                                          
## 413165: Japanese (except Nissan) or Korean  Diesel    R11    2471
## 413166: Japanese (except Nissan) or Korean Regular    R11    5360
## 413167: Japanese (except Nissan) or Korean  Diesel    R11    5360
## 413168: Japanese (except Nissan) or Korean Regular    R11    9850
## 413169: Japanese (except Nissan) or Korean  Diesel    R72      65
##        policy_id claim_amount
##     1:     63987         1172
##     2:    310037         1905
##     3:    314463         1150
##     4:    318713         1220
##     5:    309380        55077
##    ---                       
## 16177:    302759           61
## 16178:    299443         1831
## 16179:    303389         4183
## 16180:    304313          566
## 16181:    206241         2156

2 Initial Data Exploration

Having loaded in the data, we want to look at the basic data types of the columns, along with row and columns counts. We also look at a quick summary of the data.

## Observations: 413,169
## Variables: 10
## $ policy_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, …
## $ claim_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ exposure    <dbl> 0.09, 0.84, 0.52, 0.45, 0.15, 0.75, 0.81, 0.05, 0.76, 0.34, 0.10, 0…
## $ power       <fct> g, g, f, f, g, g, d, d, d, i, f, f, e, e, e, e, e, e, i, i, h, h, j…
## $ car_age     <int> 0, 0, 2, 2, 0, 0, 1, 0, 9, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8…
## $ driver_age  <int> 46, 46, 38, 38, 41, 41, 27, 27, 23, 44, 32, 32, 33, 33, 33, 54, 69,…
## $ brand       <fct> Japanese (except Nissan) or Korean, Japanese (except Nissan) or Kor…
## $ fuel        <fct> Diesel, Diesel, Regular, Regular, Diesel, Diesel, Regular, Regular,…
## $ region      <fct> R72, R72, R31, R31, R52, R52, R72, R72, R31, R11, R24, R24, R11, R1…
## $ density     <int> 76, 76, 3003, 3003, 60, 60, 695, 695, 7887, 27000, 23, 23, 1746, 17…
##    policy_id       claim_count        exposure           power          car_age      
##  Min.   :     1   Min.   :0.0000   Min.   :0.00273   f      :95718   Min.   :  0.00  
##  1st Qu.:103293   1st Qu.:0.0000   1st Qu.:0.20000   g      :91198   1st Qu.:  3.00  
##  Median :206585   Median :0.0000   Median :0.54000   e      :77022   Median :  7.00  
##  Mean   :206585   Mean   :0.0392   Mean   :0.56109   d      :68014   Mean   :  7.53  
##  3rd Qu.:309877   3rd Qu.:0.0000   3rd Qu.:1.00000   h      :26698   3rd Qu.: 12.00  
##  Max.   :413169   Max.   :4.0000   Max.   :1.99000   j      :18038   Max.   :100.00  
##                                                      (Other):36481                   
##    driver_age                                  brand             fuel       
##  Min.   :18.0   Fiat                              : 16723   Diesel :205945  
##  1st Qu.:34.0   Japanese (except Nissan) or Korean: 79060   Regular:207224  
##  Median :44.0   Mercedes, Chrysler or BMW         : 19280                   
##  Mean   :45.3   Opel, General Motors or Ford      : 37402                   
##  3rd Qu.:54.0   other                             :  9866                   
##  Max.   :99.0   Renault, Nissan or Citroen        :218200                   
##                 Volkswagen, Audi, Skoda or Seat   : 32638                   
##      region          density     
##  R24    :160601   Min.   :    2  
##  R11    : 69791   1st Qu.:   67  
##  R53    : 42122   Median :  287  
##  R52    : 38751   Mean   : 1985  
##  R72    : 31329   3rd Qu.: 1410  
##  R31    : 27285   Max.   :27000  
##  (Other): 43290

The categorical variables here are listed as factors so the first thing I will do is convert them to character strings. Factors can have some strange ‘gotchas’ in how they are used, so it is safe to switch them to character variables at the very start.

NB: I will reverse the previous sentiment and leave these variables as factors for now.

We now create separate vectors for the numerical and categorical variables so we can automatically generate different exploratory plots of the data.

2.1 Univariate Data Exploration

We create simple univariate exploratory plots.

2.1.1 Numeric Variables

We iterate through the numeric variables, looking at a density plot for each one.

## claim_count

## exposure

## car_age

## driver_age

## density

None of these plots seem very useful, so we try the same thing but now use histograms.

## claim_count

## exposure

## car_age

## driver_age

## density

2.1.2 Categorical Variables

We now iterate through each of the categorical variables by looking at boxplots of the counts of the values.

## power

## brand

## fuel

## region

2.1.2.1 exposure

The exposure variable is a bit strange - it seems strange to have policies whose duration is longer than a year in this book - but without knowledge of the business it is hard to tell. Let us see how the exposures that are longer than 1 year are distributed.

We could remove these policies, but I am inclined to leave them in for the moment at least. We may need to revisit this decision later.

2.2 Bivariate Data Exploration

We first see how a pairs plot looks. The size of the dataset makes this computationally onerous, so we sample 50,000 data points and create the pairs plot for those.

2.2.1 density vs region

Density seems a bit strange, so I want to see how density distributes across the regions as that also seems to be geographic.

First we look at boxplots:

Then we do a facetted histogram, facetting by region.

2.2.3 region vs car_age

We want to see a distribution of car_age by region in the data:

We may need to filter out cars that are exceptionally old.

2.3 Claim Data

First we look at a histogram of the individual claims without aggregating them by policy.

This does not tell us much due to the skewed nature of the claims, so we instead look at all claims below EUR 25,000:

Claims above 25,000 are so skewed that we look at these on a separate plot with a logscale on the x-axis.

To get a sense of the skew in terms of the right tail, we look at a cumulative density plot of the claim amounts:

2.3.1 Aggregate Claims by Policy

We now add up all the claims on a single policy and treat them as a single amount.

Now we look at the total claims per policy.

We first check that the merge worked properly by ensuring that claim_count and num_claim are the same.

## Source: local data table [0 x 12]
## 
## # A tibble: 0 x 12
## # … with 12 variables: policy_id <int>, claim_count <int>, exposure <dbl>, power <fct>,
## #   car_age <int>, driver_age <int>, brand <fct>, fuel <fct>, region <fct>,
## #   density <int>, num_claim <int>, total_claims <dbl>

We look at the cumulative claims per policy.

2.3.2 claimtotal by region

We do a boxplot of the total claims by region. We first will plot with all the claims to see if there is a regional pattern in the larger claims as we expect these amounts will dominate any visuals.

We now filter out the larger claims and do a boxplot for claims between 0 and 50,000.

2.3.3 Power-law Scaling

We look at the log-log plot of claim size against the cumulative number of claims of at least the size to investigate if the claim frequency obeys a power law.

For claims about 1,000 (\(\log \text{Claim} = 3\)) a straight line could do a good job of fitting the curve, so we look at that

## Warning: Removed 7 rows containing non-finite values (stat_smooth).

Encouraged by the above plots, we will model part of the claim distribution with a power law - probably to work on the likelihood of larger claims.

2.4 Univariate Plots Facetted by Claim

Now we split the data into two groups: those policies with no claims and those with at least one claim. We then create some univariate plots of the input data and facet one the claim/noclaim variable to get an idea of any differences between the two groups.

Now that we have this data, we do the same thing as before, create the univariate plots of the categorical and numeric variables, and we facet on whether or not the policies have had a claim. This allows us to make direct comparisons across the variables.

As before, we start with the numeric variables first:

## claim_count

## exposure

## car_age

## driver_age

## density

Apart from the obvious distinction between claim counts, there appears to be very little difference across the two groups, so we take a look at categorical variables.

## power

## brand

## fuel

## region

3 Data Cleaning

We now turn our attention to data cleaning and feature creation in the data. We do not have any premium information for the policy data, and may wish to convert some of the features from continuous to categorical - especially for variables such as age where we expect a non-linear influence on the output.

We may exclude data from the analysis if they are outliers.

One issue with removing outliers at this stage is that we are not entirely sure what counts as an outlier. It may be safer for the moment to leave them in and perhaps filter them out just prior to modelling when we have a better sense of what to do. We may wish to keep all the data for moment and split the modelling tasks into different parts, capturing different aspects of the data in different ways.

For the moment, we will leave the data intact.

4 Feature Creation

We turn our attention to adding new variables to our dataset to assist with the modelling. Before we do this, we should look at a summary of the data.

##    policy_id       claim_count        exposure           power          car_age      
##  Min.   :     1   Min.   :0.0000   Min.   :0.00273   f      :95718   Min.   :  0.00  
##  1st Qu.:103293   1st Qu.:0.0000   1st Qu.:0.20000   g      :91198   1st Qu.:  3.00  
##  Median :206585   Median :0.0000   Median :0.54000   e      :77022   Median :  7.00  
##  Mean   :206585   Mean   :0.0392   Mean   :0.56109   d      :68014   Mean   :  7.53  
##  3rd Qu.:309877   3rd Qu.:0.0000   3rd Qu.:1.00000   h      :26698   3rd Qu.: 12.00  
##  Max.   :413169   Max.   :4.0000   Max.   :1.99000   j      :18038   Max.   :100.00  
##                                                      (Other):36481                   
##    driver_age                                  brand             fuel       
##  Min.   :18.0   Fiat                              : 16723   Diesel :205945  
##  1st Qu.:34.0   Japanese (except Nissan) or Korean: 79060   Regular:207224  
##  Median :44.0   Mercedes, Chrysler or BMW         : 19280                   
##  Mean   :45.3   Opel, General Motors or Ford      : 37402                   
##  3rd Qu.:54.0   other                             :  9866                   
##  Max.   :99.0   Renault, Nissan or Citroen        :218200                   
##                 Volkswagen, Audi, Skoda or Seat   : 32638                   
##      region          density        num_claim       total_claims    
##  R24    :160601   Min.   :    2   Min.   :1        Min.   :      0  
##  R11    : 69791   1st Qu.:   67   1st Qu.:1        1st Qu.:      0  
##  R53    : 42122   Median :  287   Median :1        Median :      0  
##  R52    : 38751   Mean   : 1985   Mean   :1        Mean   :     83  
##  R72    : 31329   3rd Qu.: 1410   3rd Qu.:1        3rd Qu.:      0  
##  R31    : 27285   Max.   :27000   Max.   :4        Max.   :2036833  
##  (Other): 43290                   NA's   :397779

From our initial data exploration in the previous document, we have a few manipulations that may be worthwhile. We will bin some of the numeric variables, and we might combine a number of levels in some categorical variables to reduce the amount of work required.

4.1 Binning Continuous Variables

We aggregate a few of the continuous features that are unlikely to have any kind of linear response in terms of the data: driver_age, car_age and density.

We have picked a somewhat arbitrary set of cutoffs to discretise the variables for these three variables and will check their use in the models we build.

## Observations: 413,169
## Variables: 15
## $ policy_id      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ claim_count    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ exposure       <dbl> 0.09, 0.84, 0.52, 0.45, 0.15, 0.75, 0.81, 0.05, 0.76, 0.34, 0.10…
## $ power          <fct> g, g, f, f, g, g, d, d, d, i, f, f, e, e, e, e, e, e, i, i, h, h…
## $ car_age        <int> 0, 0, 2, 2, 0, 0, 1, 0, 9, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ driver_age     <int> 46, 46, 38, 38, 41, 41, 27, 27, 23, 44, 32, 32, 33, 33, 33, 54, …
## $ brand          <fct> Japanese (except Nissan) or Korean, Japanese (except Nissan) or …
## $ fuel           <fct> Diesel, Diesel, Regular, Regular, Diesel, Diesel, Regular, Regul…
## $ region         <fct> R72, R72, R31, R31, R52, R52, R72, R72, R31, R11, R24, R24, R11,…
## $ density        <int> 76, 76, 3003, 3003, 60, 60, 695, 695, 7887, 27000, 23, 23, 1746,…
## $ num_claim      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ total_claims   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cat_driver_age <chr> "(42,74]", "(42,74]", "(26,42]", "(26,42]", "(26,42]", "(26,42]"…
## $ cat_car_age    <chr> "[0,1]", "[0,1]", "(1,4]", "(1,4]", "[0,1]", "[0,1]", "[0,1]", "…
## $ cat_density    <chr> "(40,200]", "(40,200]", "(500,4.5e+03]", "(500,4.5e+03]", "(40,2…

We will attempt to build models using both continuous and binned versions of this data and compare the performance of them all.

4.2 Aggregative Categorical Levels

A number of our categorical variables have long tails: they have a reasonable number of values with small counts. This can cause an issue as parameter estimates for these levels may lack robustness and uncertainty limits are likely to be wide. To help with this, we often create a ‘catch-all’ value and aggregate all levels below a certain count to be this ‘catch-all’ value.

4.2.1 power

There is quite a long tail for the higher letters, so we redo this plot showing the total count of the policies as we account for additional power levels. This should give us a sense for the point at which we agglomerate the levels into a single value.

Looking this plot, we see that the levels from \(i\) on in the plot can be aggregated. We combine \((i,k,l,m,o,n)\) into a single level other.

We may have been slightly too aggressive with this, so just in case, we create a new variable agg_power_2 where we keep value \(i\) separate and aggregate the others.

agg_power_2 would appear to be a better aggregation of levels in terms of balanced counts - though it should be said that this may not be in anyway good or desirable.

6 R Environment

## ─ Session info ─────────────────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Ubuntu 18.10                
##  system   x86_64, linux-gnu           
##  ui       RStudio                     
##  language (EN)                        
##  collate  en_IE.UTF-8                 
##  ctype    en_IE.UTF-8                 
##  tz       Europe/Dublin               
##  date     2019-01-08                  
## 
## ─ Packages ─────────────────────────────────────────────────────────────────────────────
##  ! package      * version date       lib source        
##    assertthat     0.2.0   2017-04-11 [1] CRAN (R 3.5.1)
##    backports      1.1.3   2018-12-14 [1] CRAN (R 3.5.1)
##    bindr          0.1.1   2018-03-13 [1] CRAN (R 3.5.1)
##    bindrcpp       0.2.2   2018-03-29 [1] CRAN (R 3.5.1)
##    Boruta       * 6.0.0   2018-07-17 [1] CRAN (R 3.5.1)
##    broom          0.5.1   2018-12-05 [1] CRAN (R 3.5.1)
##    callr          3.1.1   2018-12-21 [1] CRAN (R 3.5.2)
##    CASdatasets  * 1.0-6   2018-11-20 [1] local         
##    cellranger     1.1.0   2016-07-27 [1] CRAN (R 3.5.1)
##    cli            1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
##    colorspace     1.3-2   2016-12-14 [1] CRAN (R 3.5.1)
##    crayon         1.3.4   2017-09-16 [1] CRAN (R 3.5.1)
##    data.table   * 1.11.8  2018-09-30 [1] CRAN (R 3.5.1)
##    desc           1.2.0   2018-05-01 [1] CRAN (R 3.5.1)
##    devtools       2.0.1   2018-10-26 [1] CRAN (R 3.5.1)
##    digest         0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
##    dplyr        * 0.7.8   2018-11-10 [1] CRAN (R 3.5.1)
##    dtplyr       * 0.0.2   2017-04-21 [1] CRAN (R 3.5.1)
##    evaluate       0.12    2018-10-09 [1] CRAN (R 3.5.1)
##    fansi          0.4.0   2018-10-05 [1] CRAN (R 3.5.1)
##  V feather      * 0.3.1   2019-01-07 [1] CRAN (R 3.5.2)
##    forcats      * 0.3.0   2018-02-19 [1] CRAN (R 3.5.1)
##    fortunes       1.5-4   2016-12-29 [1] CRAN (R 3.5.1)
##    fs             1.2.6   2018-08-23 [1] CRAN (R 3.5.1)
##    generics       0.0.2   2018-11-29 [1] CRAN (R 3.5.1)
##    GGally       * 1.4.0   2018-05-17 [1] CRAN (R 3.5.1)
##    ggplot2      * 3.1.0   2018-10-25 [1] CRAN (R 3.5.1)
##    glue           1.3.0   2018-07-17 [1] CRAN (R 3.5.1)
##    gtable         0.2.0   2016-02-26 [1] CRAN (R 3.5.1)
##    haven          2.0.0   2018-11-22 [1] CRAN (R 3.5.1)
##    hms            0.4.2   2018-03-10 [1] CRAN (R 3.5.1)
##    htmltools      0.3.6   2017-04-28 [1] CRAN (R 3.5.1)
##    httr           1.4.0   2018-12-11 [1] CRAN (R 3.5.1)
##    jsonlite       1.6     2018-12-07 [1] CRAN (R 3.5.1)
##    knitr          1.21    2018-12-10 [1] CRAN (R 3.5.1)
##    labeling       0.3     2014-08-23 [1] CRAN (R 3.5.1)
##    lattice        0.20-38 2018-11-04 [3] CRAN (R 3.5.1)
##    lazyeval       0.2.1   2017-10-29 [1] CRAN (R 3.5.1)
##    lubridate      1.7.4   2018-04-11 [1] CRAN (R 3.5.1)
##    magrittr       1.5     2014-11-22 [1] CRAN (R 3.5.1)
##    Matrix         1.2-15  2018-11-01 [1] CRAN (R 3.5.1)
##    memoise        1.1.0   2017-04-21 [1] CRAN (R 3.5.1)
##    modelr         0.1.2   2018-05-11 [1] CRAN (R 3.5.1)
##    munsell        0.5.0   2018-06-12 [1] CRAN (R 3.5.1)
##    nlme           3.1-137 2018-04-07 [3] CRAN (R 3.5.0)
##    packrat        0.5.0   2018-11-14 [1] CRAN (R 3.5.1)
##    pillar         1.3.1   2018-12-15 [1] CRAN (R 3.5.1)
##    pkgbuild       1.0.2   2018-10-16 [1] CRAN (R 3.5.1)
##    pkgconfig      2.0.2   2018-08-16 [1] CRAN (R 3.5.1)
##    pkgload        1.0.2   2018-10-29 [1] CRAN (R 3.5.1)
##    plyr           1.8.4   2016-06-08 [1] CRAN (R 3.5.1)
##    prettyunits    1.0.2   2015-07-13 [1] CRAN (R 3.5.1)
##    processx       3.2.1   2018-12-05 [1] CRAN (R 3.5.1)
##    ps             1.3.0   2018-12-21 [1] CRAN (R 3.5.2)
##    purrr        * 0.2.5   2018-05-29 [1] CRAN (R 3.5.1)
##    R6             2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
##    ranger       * 0.10.1  2018-06-04 [1] CRAN (R 3.5.1)
##    RColorBrewer   1.1-2   2014-12-07 [1] CRAN (R 3.5.1)
##    Rcpp           1.0.0   2018-11-07 [1] CRAN (R 3.5.1)
##    readr        * 1.3.1   2018-12-21 [1] CRAN (R 3.5.2)
##    readxl         1.2.0   2018-12-19 [1] CRAN (R 3.5.2)
##    remotes        2.0.2   2018-10-30 [1] CRAN (R 3.5.1)
##    reshape        0.8.8   2018-10-23 [1] CRAN (R 3.5.1)
##    reshape2       1.4.3   2017-12-11 [1] CRAN (R 3.5.1)
##    rlang          0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
##    rmarkdown      1.11    2018-12-08 [1] CRAN (R 3.5.1)
##    rprojroot      1.3-2   2018-01-03 [1] CRAN (R 3.5.1)
##    rstudioapi     0.8     2018-10-02 [1] CRAN (R 3.5.1)
##    rvest          0.3.2   2016-06-17 [1] CRAN (R 3.5.1)
##    scales         1.0.0   2018-08-09 [1] CRAN (R 3.5.1)
##    sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 3.5.1)
##    sp           * 1.3-1   2018-06-05 [1] CRAN (R 3.5.1)
##    stringi        1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
##    stringr      * 1.3.1   2018-05-10 [1] CRAN (R 3.5.1)
##    testthat       2.0.1   2018-10-13 [1] CRAN (R 3.5.1)
##    tibble       * 2.0.0   2019-01-04 [1] CRAN (R 3.5.2)
##    tidyr        * 0.8.2   2018-10-28 [1] CRAN (R 3.5.1)
##    tidyselect     0.2.5   2018-10-11 [1] CRAN (R 3.5.1)
##    tidyverse    * 1.2.1   2017-11-14 [1] CRAN (R 3.5.1)
##    usethis        1.4.0   2018-08-14 [1] CRAN (R 3.5.1)
##    utf8           1.1.4   2018-05-24 [1] CRAN (R 3.5.1)
##    withr          2.1.2   2018-03-15 [1] CRAN (R 3.5.1)
##    xfun           0.4     2018-10-23 [1] CRAN (R 3.5.1)
##    xml2           1.2.0   2018-01-24 [1] CRAN (R 3.5.1)
##    xts          * 0.11-2  2018-11-05 [1] CRAN (R 3.5.1)
##    yaml           2.2.0   2018-07-25 [1] CRAN (R 3.5.1)
##    zoo          * 1.8-4   2018-09-19 [1] CRAN (R 3.5.1)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/lib/R/site-library
## [3] /usr/lib/R/library
## 
##  V ── Loaded and on-disk version mismatch.